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Abstract 



o 



We use partial class memberships in soft classification to model uncertain labelling and mixtures of classes. Partial class member- 
ships are not restricted to predictions, but may also occur in reference labels (ground truth, gold standard diagnosis) for training and 
validation data. 

Classifier performance is usually expressed as fractions of the confusion matrix, like sensitivity, specificity, negative and positive 
predictive values. We extend this concept to soft classification and discuss the bias and variance properties of the extended perfor- 
mance measures. Ambiguity in reference labels translates to differences between best-case, expected and worst-case performance. 
We show a second set of measures comparing expected and ideal performance which is closely related to regression performance, 
namely the root mean squared error RMSE and the mean absolute error MAE. 

All calculations apply to classical crisp as well as to soft classification (partial class memberships as well as one-class classifiers). 
The proposed performance measures allow to test classifiers with actual borderline cases. In addition, hardening of e.g. posterior 
probabilities into class labels is not necessary, avoiding the corresponding information loss and increase in variance. 

We implemented the proposed performance measures in R package "softclassval" which is available from CRAN and at http: 
//softclassval.r- forge.r-project.org. 

Our reasoning as well as the importance of partial memberships for chemometric classification is illustrated by a real-word 
application: astrocytoma brain tumor tissue grading (80 patients, 37 000 spectra) for finding surgical excision borders. As borderline 
cases are the actual target of the analytical technique, samples which are diagnosed to be borderline cases must be included in the 
validation. 
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Notation 



Throughout this paper, we use the following symbols: 



Symbol 

n,n g 

G £{l,...,n g ) 
g e [0, l]«s 

p, r instead of g 
^in s Ie ^- x n g p re d.) 

Z(p,r) 

fairig ref. x n s pred.) 

Sens° pera ""' 



Meaning 

number of samples and classes, respectively 

crisp class (label) 

class membership (row) vector 

matrix of class memberships 

distinguish prediction and reference 

confusion matrix for n samples 

function to calculate elements of Z for single samples 
residual confusion matrix 

sensitivity wrt. class G, calculated using operator) 
conjunction (AND-operator) 
negation (NOT-operator) 



1. Introduction 
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Validation of chemometric models is a crucial step: it is not 
enough to train a good model, but its quality actually needs to 
be demonstrated with representative test samples. Thus, one 
firstly needs a plan for obtaining suitable test samples and de- 
cide whether e. g. cross validation is appropriate, or whether 
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unknown future samples are needed. An excellent discussion 
of such considerations is given by Esbensen and Geladi [1]. 
Secondly, the performance on the basis of the results is de- 
scribed by suitable quantitative measures, such as the root mean 
squared error (RMSE) in calibration or sensitivity, specificity 
and the like for classifiers. 

This paper focusses on the second aspect, although the moti- 
vation for this study did arise from the first requirement. In the 
following section, we introduce a tumor tissue grading applica- 
tion where representative test sets need to include ambiguous 
samples, i. e. samples that according to the reference labelling 
(ground truth, gold standard diagnosis) partially belong to more 
than one class. We then show how to extend well-known classi- 
fier performance measures to work with partial class member- 
ships in the reference labels. We next apply these to our tumor 
classifier, and finally discuss some more properties of the ex- 
tended performance measures when applied to test samples that 
are unambiguously assigned to their class by the reference la- 
bels. 

1.1. Application: Grading of Astrocytoma Tissues 

We illustrate the use of partial memberships with a bio- 
spectroscopic three class classification problem. A detailed de- 
scription of the application, including experimental details and 
spectroscopic interpretation, has already been published [2]. 
Briefly, gliomas are the most common primary brain tumors. 
Among them, the astrocytomas are the largest subgroup. The 
world health organization (WHO) distinguishes four grades of 
astrocytomas according to their histology and clinical behavior 
[3-5]. Astrocytomas °II tend to further de-differentiate and gain 
in malignancy. Astrocytomas °III are malignant, and glioblas- 
tomas (°IV; GBM) are the most undifferentiated gliomas. As- 
trocytomas °III and GBM can originate from lower grade tu- 
mors, or appear de novo [3, 6]. Pilocytic astrocytoma (°I) are 
predominantly juvenile and clinically distinct tumors and are 
not considered here. 

Glioma treatment includes surgical excision, if possible. The 
complete removal of the tumor is one of the most important 
factors for the prediction of the recurrence-free survival time of 
the patient [6, 7]. Boker [8] reports that complete removal un- 
der surgical microscope reduces the number of tumour cells by 
90-95 %, still leaving about 10 10 tumour cells in the patient's 
brain. Tumor surgery outside the brain often applies ample 
safety margins around the tumor to ensure that all tumour cells 
are removed. This is not possible in brain surgery as the normal 
brain tissue must be preserved. An additional difficulty arises 
from the infiltrative growth of the gliomas: the tumor border 
is hardly visible. Within 2 cm distance from the solid tumour, 
still about 10% of the cells are tumour cells and even more 
than 4 cm outside the solid tumor glioma cells are found[8]. 
Thus, although complete removal of the tumor is desired, the 
surgeon often decides to remove only the malignant part of the 
tumor. Stereo-navigation based on pre-operative imaging such 
as magnetic resonance tomography (MRT) is used routinely to 
delineate the excision border, but the precision is limited by the 
brain shift during surgery. This constitutes the need for addi- 



tional tools that help surgeons in finding the proper excision 
border in-situ and in-vivo. 

The WHO grading scheme lists (morphological) properties 
of tissues. Conceptually, it is a traditional classification system 
in the sense that a set of classes is defined, and each sample be- 
longs to exactly one of the classes. The tumor-biological real- 
ity, however, is not as distinct as the WHO grading scheme and 
changes at the molecular level do not necessarily occur at the 
same time as the changes in the morphology diagnosed during 
histological grading of the tumors [9, 10]. Neuropathologists 
frequently spot areas where cells are actually in the process of 
de-differentiation, i.e. in the transition from one class to the 
next. This leads to ambiguity in the description of those areas. 

Another type of ambiguous diagnosis states that a tissue con- 
sists of a mixture of cells of different grades, e. g. tumor cells 
infiltrating normal tissue. This ambiguity can occur if the mea- 
surements spatially do not resolve cells. Diagnosis at single 
cell level, however, is not practical for intra-surgical guidance. 
The working precision of the surgeons (up to ca. 1 mm) re- 
quires corresponding spatial resolution of the diagnostic tool: 
too high spatial resolution not only means longer measurement 
times and/or undersampling but also confronts the surgeon with 
too detailed information in a time-critical situation. 

Both types of ambiguity occur in our example application, 
grading of brain tumor tissue for intra-surgical decision. Note 
however, that grading of the actually measured tissue is differ- 
ent and easier than grading of the patient's tumor. A detailed 
discussion of the differences between these two distinct grad- 
ing tasks has been given in [2]. 

1.2. Crisp and Soft Classification 

Crisp classification requires each sample to belong to exactly 
one of the n g pre-defined classes. This restriction can be relaxed 
in two independent ways. 

Multiple membership: A sample may belong to more than 
one class (or no class at all). 

Partial membership: A sample may belong partially to any 
given class. 

Multiple membership is often associated with one-class clas- 
sifiers. One-class classifiers model each class independently of 
the other classes [11-13]. While this is not the case in our appli- 
cation (the tissue classes are mutually exclusive) the reasoning 
presented here works for one-class classifiers as well. We re- 
fer to the boundary condition of crisp classifiers that each sam- 
ple must belong to exactly one class as "closed world". "Open 
world" intermediate results can be transformed in closed world 
results by "winner takes all" or soft max (for partial member- 
ships) rules. 

The second concept is in analogy to the transition from hard 
(crisp) cluster analysis to fuzzy cluster analysis: the degree of 
belonging to a class is represented by a continuous membership 
value. In the remote sensing community, the term soft classifi- 
cation has already been established for such partial class mem- 
berships [14-18], so we adapt this terminology. Also, fuzzy usu- 
ally refers to ambiguity as opposed to uncertainty, but the par- 
tial memberships can denote both. Partial class memberships 
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can be treated as intermediate results which are then "hardened" 
into crisp class memberships. 

In chemometric modelling, the term "soft" is often used in 
different ways. Hard vs. soft modelling can refer to the amount 
of prior knowledge that is reflected in the model equations. 
While hard models fit equations that are derived from strong 
assumptions or first principles (<?. g. order of reaction for ki- 
netic studies), soft models make less assumptions and model 
empiric approximations {e.g. fitting some sigmoid) [19, 20]. 
The "soft" in Soft Independent Modelling of Class Analogies 
(SIMCA) comes from this distinction. SIMCA is an estab- 
lished and widespread one-class classification model, see e. g. 
[12, 21, 22] that has also been used in the context of vibra- 
tional spectroscopic diagnosis or distinction of biological tis- 
sues [23, 24] and cells [25]. Varmuza and Filzmoser [21], how- 
ever, seem to use the term "soft" synonymous for one-class 
classifiers and Brereton [11] defines "soft" as allowing overlap 
in the (feature) space assigned to each class. "Soft" takes yet 
another meaning for soft margins of support vector machines 
(SVM) where soft margins allow samples in between the mar- 
gins of the SVM in feature space, even though they are labelled 
as belonging to exactly one class [26]. 

In contrast to these soft aspects of chemometric models, we 
use the term soft in this paper with respect to class labels, and 
contrast it to crisp. The performance measures we discuss in 
the present paper therefore work regardless of these various 
soft aspects of classification models: validation usually treats 
the classifier as a black box that calculates class membership 
from the spectrum (feature vector) of a test sample. Our perfor- 
mance measures can be calculated just the same way whether 
the classifier uses hard or soft modelling, models classes inde- 
pendently or in distinction of the other classes, or whether it is 
a SVM with or without soft margin. 

While partial memberships are widely used in cluster analy- 
sis (fuzzy c-means clustering is well established for the analy- 
sis of spectra of biological tissues [27-30]), this is not the case 
for chemometric classification. Classification addresses quali- 
tative questions. But qualitative analysis is usually carrid out 
by chemometric quantification (regression) which is then eval- 
uated with respect to a threshold or limit. Calibration models 
adequately cover chemical composition, but are not appropriate 
for many bio-spectroscopic classification problems. 

We employ (row) vectors g e {0, \ }" s with the elements cor- 
responding to the sample's class gc = 1 and all other elements 
to express the crisp class membership of a sample, which can 
be combined into a membership matrix with each row corre- 
sponding to one sample. We will use the term "crisp" label or 
sample also for samples that happen to have all class member- 
ships either or 1, and "soft" for samples where at least one 
membership value is not exactly or 1. Thus, there may be 
(and often are) crisp samples also in a soft data set. The bound- 
ary condition for closed-world classifiers is YijL\ Sj - 1- Partial 
memberships allow the elements of g to take any value between 
Oandl: ge [0,1]"*. 

Partial memberships can arise from two different concepts: 

probability or uncertainty: This is common for predictions 



like posterior probabilities, but may also be the case for 
reference labels. 

mixtures of the underlying classes as in homogeneous mix- 
tures or heterogeneous mixtures where the heterogeneity 
is not resolved by the measurement. In chemistry, this is 
closely related to the concept of concentration and thus to 
calibration. Non-chemical fields frequently use fuzzy set 
theory. 

In practice both aspects can arise for one and the same prob- 
lem. In biomedical applications, uncertain references arise e. g. 
from the pathologist expressing uncertainty: "there may be tu- 
mor cells between these normal cells", or from disagreement 
among a panel of pathologists. In our experiments, the transfer 
of the histological diagnosis onto the measurement of a parallel 
section is an additional source of uncertainty. Our samples also 
contain two different types of mixtures: firstly, a tissue may 
consist of cells of different cell types, while the individual cells 
are not resolved by the measurement's spatial resolution. The 
second type of mixture are currently de-differentiating tissues, 
e.g. cells undergoing the transition from °II to °III, which are 
therefore between the ordered classes. 

Partial class memberships in classification may be used and 
discussed at three levels: 

Soft predictions are widely used: posterior probabilities of 
linear or quadratic discriminant analysis (LDA and QDA) 
or logistic regression (LR); the voting proportion of k near- 
est neighbors (kNN), random forests, etc. Soft predictions 
are frequently considered an intermediate result and are 
then hardened by thresholds. 

Soft training samples can be used by methods like LR, artifi- 
cial neural networks or partial least squares discriminant 
analysis, PLS-DA), but up to now this is rarely done. 

Soft test samples, i. e. samples with ambiguous or uncertain 
reference (ground truth, gold standard diagnosis) are the 
topic of this paper. 

For classifier training, traditionally either crisp reference la- 
bels are enforced and/or borderline cases 1 are excluded. This 
raises several issues that are avoided by allowing soft labels. 
Hardening of a continuous variable implies a loss of informa- 
tion. Dichotomization of a logistically distributed random vari- 
able deletes at least 25% of the information [31]. In biomed- 
ical spectroscopy, crisp reference labels are often enforced by 
requesting the pathologist to assign the sample to exactly one 
of the classes, even if the pathologist describes the sample as 
currently undergoing de-differentiation or consisting of mixed 
cell populations. In practice, the pathologist refuses to diag- 
nose certain samples unambiguously (leading to exclusion of 



Here, we use the terms borderline case and ambiguous sample synony- 
mously and exclusively with respect to the true class membership. Spectra that 
are spectroscopically in between the typical spectra of classes will be referred 
to as "close to the class boundary". 
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the sample). For other samples the written-out diagnosis con- 
tains information about the ambiguity that is not reflected by 
the assigned crisp class. The same applies to diagnoses given 
by a panel of pathologists. Again, either the majority class is 
used (removing the information contained in the disagreement) 
or the case is excluded. For the brain tumour patients, crisp 
diagnoses given by the local neuropathologist and neuropathol- 
ogists from the tumour reference center often differ more than 
the written-out diagnoses. This is in accordance with hardening 
as possible cause of further variance. Likewise, the results of 
the panel diagnosis published by Kendall et al. [32] have higher 
discrepancy for intermediate classes [2]. 

In any case, one either uses possibly inappropriate descrip- 
tions as reference or gold standard diagnosis, or reduces the 
available number of samples. In bio-spectroscopy, where fre- 
quently hundreds or thousands of variates are measured for tens 
of patients only, this is a critical issue. In our application, | of 
the patients and almost half of the spectra would have to be ex- 
cluded. Moreover, excluding borderline samples comes at the 
risk of overestimating class separation: throwing away all diffi- 
cult cases creates an easy problem. While such filtering and the 
corresponding output of "no certain prediction possible" are ap- 
propriate for certain analytical tasks, this is not the case in our 
application. Borderline samples are actual examples of the class 
boundaries. Excluding them from classifier training means ex- 
cluding most valuable samples. The more so in our application, 
as these are also examples of the actual target samples of the 
glioma grading technique. 

While one may also arrive at a good classifier with com- 
pletely unambiguous training data, the validation must use care- 
fully collected test samples: samples representative for the field 
of use. For our astrocytoma application these, again, are the 
borderline cases. Validation methods for soft labeled samples 
are thus even more crucial than the respective training strate- 
gies. 

The remote sensing community has been using soft classifi- 
cation for a long time to describe the mixtures due to low spa- 
tial resolution. Proposals for validation of soft classifiers are 
reported in the literature [14-16] -yet, it is still considered an 
unsolved problem [17, 18]. We critically discuss these propos- 
als below. 

2. Classifier Performance: the Confusion Matrix and frac- 
tions thereof 

For convenience, we abbreviate sums of particular parts of 
the confusion matrix Z as follows: summation includes all pos- 
sible indices according to the conditions written in the indices. 
E.g. Yu^uP stands for Y?*=\ Z,p (sum all elements of column 
P), Z,^g,p means 2/E|i,...,n s |/#c) Zy> (sum all elements except 
that in row G of column P). is the sum over all samples. 

2.1. Hard Classification 

The validation results of a crisp classifier are usually tabu- 
lated in the confusion matrix Z. This matrix (fig. la) counts 
how many samples that truly belong to each class (rows) were 



predicted to belong to that class (columns). In other words, a 
sample belonging to class R and predicted to belong to class P 
is counted in Z^p. Sometimes, a notation as function of predic- 
tion and reference is more convenient: 

n n 

where A stands for the AND operator which returns 1 if and 
only if both reference class membership r, and prediction class 
membership pj are 1, otherwise the return value is 0. In order 
not to clutter up the notation, we indicate the sum over all sam- 
ples by Yin without introducing an index for the sample. The 
symbol Z for the confusion matrix will imply that this sum is 
already taken, whereas Z(r,-, pj) is evaluated for each sample. 
The results are then summed up to give Zy. 

Confusion matrices are frequently pooled (<?. g. k confusion 
matrices obtained during one iteration of fc-fold cross valida- 
tion are fused into one by matrix addition). Confusion matri- 
ces yield a very detailed overview of a classifier's performance. 
Frequently, the confusion matrix is further summarized by pro- 
portions calculated thereof. These proportions (fig. lb - le) an- 
swer questions with regard to the predictive abilities of a classi- 
fier. Different disciplines refer to these fractions differently. We 
use the medical terminology [34, 35]: 

Sensitivity Sense: How well does the classifier recognize sam- 
ples of class Gl 

Specificity Spec G : How well does the classifier recognize that 
a sample does not belong to class Gl 

Positive Predictive Value PPV C : Given the prediction is class 
G, what is the probability that the sample truly belongs to 
Gl 

Negative Predictive Value NPV C : Given a prediction "does 
not belong to class G", what is the probability that the sam- 
ple truly does not belong to Gl 

Note that the predictive values are the "inverse" (as in inverse 
calibration) of sensitivity and specificity: sensitivity and speci- 
ficity report the distribution of test outcomes as function of the 
true disease status. In contrast, the predictive values give the 
distribution of true disease status as function of the observed 
test outcome. 

For users of the classifier, the predictive values are usually of 
more interest than sensitivity and specificity: patients and doc- 
tors want to know whether this particular patient is ill rather 
than whether the test can recognize ill people; manufacturers 
want to know whether a product can be sold rather than whether 
bad batches can be found. Answering these questions needs 
to take into account the prior probabilities of the classes (in 
medical diagnosis: prevalence). The relative frequencies of the 
classes in the test set (row sums of the confusion matrix Z) 
do not necessarily reflect the prior probabilities. Moreover, the 
prior probabilities can vary greatly among different populations 
(consider e. g. HIV tests for blood donors and drug addicts, 
respectively). The reported predictive values should therefore 
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prediction 



Sens(r, p) 



ppv(r, p) 




(a) confusion matrix (b) Sens^ 



Spec(r, p) NPV(r, p) 

(c) Spec A (d) PPV A (e) NPV A (f) Symmetric relations 



Figure 1: Confusion matrix (a) and characteristic fractions for sum constrained multi class classifiers (b) - (e). The parts of the 
confusion matrix summed as numerator and denominator for the respective fraction with respect to class A are shaded, (f) Symmetry 
between the measures. EH = □: mirror horizontally and vertically or at point (r; p) \-* (1 — r; 1 - p), 0: mirror at major diagonal 
(r; p) i — > (p; r), and S: mirror at minor diagonal (r; p) h-> (1 — p; 1 - r). All symmetry elements with respect to the center of the 
value space at (0.5; 0.5). The icons use Cartesian coordinates, (a) - (e) are reprinted from [33], with permission from Elsevier. 



be corrected for the different composition of test set and tar- 
get population and also specify the (assumed) composition of 
the target population. The same caution applies for all mea- 
sures that combine different reference classes, such as overall 
accuracy, the chance agreement needed to calculate corrected 
performance values like the k statistic, etc. 

Usually, these performance measures relate to medical deci- 
sions whether a certain disease is present or absent. This cor- 
responds to one-class classifiers. The questions translate to the 
following expressions: 



Sens G = 



PPVr 



Spec c = 



NPV r 



Zg.g 

TjnPG 

Tjn r ^G 
Z^C^G 
YtnP^G 



(2) 
(3) 
(4) 
(5) 



The membership g^c of the dummy class "-rG" ("not class G") 
is obtained as g^ c = 1 - 8c- The same expressions can be used 
for closed-world classifiers, where the constraint in addition al- 
lows the alternative calculation as the sum of memberships to 
the other classes g-, c = 1 - g c - Z^g 8g> see n g- lb - le. 

In analogy to the addition of confusion matrices, the overall 
(multi-sample) performance is the average of the single sample 
performances weighted by the denominator variable. 

Like the confusion matrix, also the performance measures 
can be written as function, and due to the symmetry between 
the performance measures (compare fig. If), all operators can 
be expressed using one basic underlying function. Perfor- 
mance measures as well as the class memberships refer to each 
class independently of the other classes for both one-class and 
closed-world classifiers. We drop the class index for conve- 



Sens(r,/?) = 



E« Z(r,/?) 



Spec(r,p) = Sens(l - r, 1 — p) 
PPV(r,/?) = Sens(p,r) 
NPV(r,p) = Sens(l -p, l - f) 



(6) 

(7) 
(8) 
(9) 



2.2. Soft Confusion Matrices 

To generalize the confusion matrix and performance mea- 
sures for soft reference and prediction, the Boolean AND- 
operator A (conjunction) in the definition of the crisp confusion 
matrix (eq. 1) is replaced by a suitable operator for continuous- 
valued input in the range [0, 1]. The three main candidates 
are the minimum (weak conjunction), as proposed e. g. by 
Lukasiewicz and Godel, the strong conjunction max(x+y- 1, 0) 
(Lukasiewicz), and the product [36-38]. These operators reflect 
the ambiguity in the performance estimate due to the ambiguity 
expressed by the soft memberships. 

The rationale behind these operators can be illustrated by a 
situation where low (e. g. spatial) resolution causes the ambi- 
guity (fig 2). Say, a number of cells are in the measurement 
volume of a spectrum, and half of them are cancer cells and the 
other half are normal. The classifier yields a fraction of 0.8 for 
cancerous. In the best case, the classifier recognized the cancer- 
ous half correctly, so the conjunction (overlap) for the "cancer" 
class is 0.5. In the worst case, the classifier assigns "cancer" to 
normal cells. However, since at least 0.3 must still be assigned 
to the correct "cancer" class, the overlap is 0.3. The true over- 
lap can be anywhere between these bounds, depending on the 
true distribution of cancer cells and the distribution of cancer 
cells predicted by a high-resolution classifier. E. g. if they are 
uniformly randomly distributed, for each cell the chance that it 
is both cancerous and predicted to be cancerous is 0.5 ■ 0.8 = 
40 %, and the expected overlap is 0.4 (middle column of fig 2). 
The top row of the supplementary figure S. 1 illustrates the three 
operators as function of reference and predicted memberships, 
fig. 3 compares them for given reference memberships. 

The weak conjunction is the standard AND-operator in fuzzy 
logic, and has been used to compute soft confusion matrices 
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strong AND product AND weak AND 
n = 0.5 i mil mn 

i iiiii mi iiiiiiiiii iiiiiiiiitp 

Z (J = max(r ; + pj - 1,0) = 0.3 r r pj-QA min(r;, pj) = 0.5 

scenario worst case expected best case 

Figure 2: The soft AND-operators: hypothetical high-resolution scenarios corresponding to a low resolution situation with reference 
= 0.5 (top row) and prediction = 0.8 (middle row) membership to the black class. In each column, the overlap (bottom row) is 
obtained by the classical Boolean AND: for each position, Z; i pos = r i pos A p^ pos , the soft conjunction is the fraction where both r, 
AND pj belong to the black class. 
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Figure 3: Behavior of the confusion matrix function TXrupj) 
for the three operators for reference memberships (points) r, = 
1 (black), 0.75 (light gray), and 0.2 (dark gray): Z weak (upper 
bound of the parallelograms, continuous line), Z strong (lower 
bound of the parallelograms, dotted line), and Z prod (dashed 
line). For reference class membership r, = 1 all three Z equal 
the predicted membership pj. 




Figure 4: Recombination of Z weak and Z strong into Z opt and 
Z pess . The diagonal of Z weak and the off-diagonal elements 
of Z stlong measure the best possible performance (light gray), 
while Z strong 's diagonal and Z weak 's off-diagonal elements re- 
port the worst case performance (dark gray). The resulting con- 
fusion matrices give the most optimistic and most pessimistic 
view on the classifier's performance in accordance with the 
given reference memberships and the observed predictions. 



[15]. 

Z v,eak (r i ,p j )=mm(r i ,p j ) (10) 

The minimum is the highest possible overlap between predic- 
tion and reference (best case scenario in fig. 2 and upper bound 
in fig. 3). 

The strong conjunction has been introduced for soft classi- 
fier performance by Pontius et.ol. [39]. It reports the lowest 
possible overlap between reference and prediction (worst case 
scenario in fig. 2 and lower bound in fig. 3). 

Z strong (r,, Pj ) = maxfo + Pj - 1,0) (11) 

As rnaxfa +^ -1,0) = rj - min(r h 1 - pj), Z weak and Z strong are 
point symmetric about (p - |;Z = \r) to each other (fig. 3). 

The matrix diagonal of Z weak reports the best possible perfor- 
mance that is in accordance with the given reference and predic- 
tion. Likewise, the off-diagonal elements are the worst possible 
performance for the respective type of misclassification R i-> P. 
Z stron g behaves antithetically. 

Both Z weak and Z strong lack two properties of crisp confusion 
matrices that have been identified as desirable for soft confu- 
sion matrices [16, 17]: firstly, their marginal sums do not equal 
the reference and prediction class membership vectors. Sec- 



ondly, perfect reproduction does not produce a diagonal confu- 
sion matrix and is thus more difficult to recognize than in crisp 
classification problems. 

Several proposals on how to "repair" both marginal sums and 
diagonal structure of Z weak for perfect reproduction of the refer- 
ence exist [16, 17, 39]. They all distribute the remainder of the 
prediction after the agreement (diagonal) has been subtracted. 
Again, diagonal and off-diagonal elements of these composite 
confusion matrices do not share the same interpretation. Silvan- 
Cardenas and Wang [17] report that the use of soft confusion 
matrices in practice appears to be restricted to the diagonal of 
Z weak ,Z weak (r c ,/7 G ). 

Interestingly, the properties of Z weak and Z strong have not yet 
been interpreted with respect to resulting bias of the perfor- 
mance measure. Using exclusively the diagonal of Z weak re- 
sults in a strong optimistic bias: only the "optimistic" part of 
Z weak is used. We are not aware of any application where the 
corresponding pessimistic performance measures are reported, 
although Z stIong as a measure of the least possible overlap is 
mentioned (but not used) by Pontius et.al. [39]. 

Using a performance measure that has by construction a 
strong optimistic bias, i. e. overestimates the classifier's per- 
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formance, is clearly not appropriate for an application where 
the classifier should ultimately indicate whether brain tissue is 
cut out or not. Similar caution is necessary in most biomedical 
and many chemometric applications. 

We therefore propose to recombine Z weak and Z strong into 
"optimistic" and "pessimistic" confusion matrices Z opt and 
Z pess as illustrated in fig. 4. These two matrices hold the best 
and worst possible performance for the observed test results, 
and can therefore be interpreted consistently without the need 
to distinguish diagonal and off-diagonal elements. Together, 
the two confusion matrices span the range of possible perfor- 
mances that is in accordance with the available reference (gold 
standard diagnosis) and the observed predictions. The ambigu- 
ity in the reference labels causes this uncertainty about the true 
performance: any performance in this range may be the true 
performance of the classifier. Due to the ambiguity or uncer- 
tainty in the reference labels, the true performance cannot be 
further narrowed down. 

We define the performance measures (eqs. 6 -9) so that only 
single elements from the diagonal of the confusion matrix Z are 
needed to obtain these optimistic and pessimistic bounds. 

The interpretation as best and worst possible performance 
does not take into account that the validation results are actu- 
ally performance estimates. "Best" and "worst" here refer to 
the uncertainty due to the ambiguity represented by soft class 
memberships. The performance estimates are subject to addi- 
tional uncertainty due to the sampling of the actual test set and 
possible instability of the "surrogate" models computed during 
cross validation etc. However, this is outside the scope of this 
paper. 

The product has been used as AND-operator for continuous- 
valued logic as well, e. g. in Reichenbach's probability logic 
[40], and has also been discussed for soft confusion matrices 
[16, 17, 39,41]: 

7F oi {r i ,p j )=r r p j (12) 

Interpreting the class memberships as probabilities, Z prod gives 
the expected amount of coincidence for independent processes 
determining the class memberships. In the mixture interpreta- 
tion, Z plod follows from the information loss due to low (spatial) 
resolution: assume crisp reference and prediction are available 
at high resolution, but the location information is lost (the high 
resolution data is mixed randomly). The expected confusion 
matrix (normalized by the respective number of samples) in this 
situation is just the product-based confusion matrix Z plod . From 
a Bayesian point of view, a uniform prior is used in both inter- 
pretations. 

The marginal sums of the product-based confusion matrix 
Z prod behave like the marginal sums of the crisp confusion ma- 
trix, for closed world as well as for one-class classifiers: the 
row sums are 2«( r ■ 2 p) and the column sums are £„(p ■ 2 f). 
The sum over all elements is £ p • Z r - Specifically, both 
marginal sums and the total element sum of Z prod equal the 
number of samples for closed world classifiers. However, like 
the other soft confusion matrices (except Z opt ), Z prod is not di- 
agonal if the prediction equals the soft reference. This may 



be seen as expression of the remaining uncertainty or ambigu- 
ity arising either from the lack of further information about the 
(unresolved) distribution of the classes, or from the uncertainty 
encoded in both reference and prediction. 

2.3. Calculating the Performance Measures for Soft Reference 
and Prediction. 

Eqs. (6) to (8) can directly be used with the soft confusion 
matrices. Note that the performance measures refer only to di- 
agonal elements of the confusion matrix Z. Thus, all problems 
due to the marginal sums not equaling prediction and reference 
membership vectors are avoided, including possible specifici- 
ties or negative predictive values >100% for Z weak . Z weak and 
Z stlong directly yield the most optimistic and most pessimistic 
case. Fig. S. 1 illustrates the performance measures for the three 
different operators. Note that each pair of Z weak (optimistic) 
and Z strong (pessimistic) leaves a quarter of the input space com- 
pletely without information: if reference and prediction are too 
ambiguous, they are in accordance with any possible value of 
the performance measure (interval width in fig. S.l). 

For the product operator, the four characteristic measures 
simplify to prediction, 1 - prediction, reference and 1 - ref- 
erence for each sample and weighted averages thereof for the 
multi-sample performance. 

The Difference between Prediction and Reference. The mix- 
ture interpretation of soft memberships suggests a treatment of 
soft classification analogous to regression. 

Regression residuals measure the deviation of the prediction 
from the reference. A short inspection of the soft confusion ma- 
trices reveals that Z weak yields performance 1 if the prediction 
p equals (or exceeds) the reference r. This means that it inher- 
ently reports deviations from the ground truth or gold standard 
diagnosis (though only for too low estimates, too high estimates 
are not penalized). Z strons produces a more complicated behav- 
ior (bottom row of fig. S.l). 

In contrast, the resulting performance measures for the prod- 
uct operator simplify to the regression errors distributed accord- 
ing to the reference memberships: analogous to the calculation 
of regression residuals s — y—y, we compare the observed con- 
fusion matrix Z plod (r,p) with the "ideal" confusion matrix for 
the actual reference Z pmd (r,p = r) [41]: 

A prod = zprod^ p) _ Z prod (r; ^ { U) 

Just as for regression, the sign of the residuals distinguishes 
over- or underestimation. Thus we sum the absolute deviations 
rather than their signed values (squared deviations are discussed 
below). In closed world systems, every underestimation implies 
the same amount of overestimation in other classes: the row 
sums of A prod are 0. 

Also, A measures an error, so we compute the complemen- 
tary 1 - | A | as our performance measures refer to the correct 
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part of the prediction: 



= 1 - 



Z n \V™ i ( r ,p)-ZF° d (r,r)\ 



\P-A 



(14) 

(15) 
(16) 



Sens uses the mean absolute error weighted by the refer- 
ence memberships (compare eq. 6). The specificity reports the 
remainder of the residuals, which is attributed to samples not 
belonging to the class: 



S P ec MAE (r,p) = l-2v 1 - r ^l/'-^ 



(17) 



The predictive values characterize the inverse thought, con- 
sequently deviations are distributed according to the predicted 
memberships: 



PPV R 



NPV* 



3 (r,/?) = l-2^lp-1 



(18) 
(19) 



Mean Absolute Error MAE and Root Mean Squared Error 
RMSE. Instead of the weighted MAEs, the respective RMSEs 
can be used, e. g.: 



Sens 



RMSE 



(20) 



The MAE is more closely related to the usual error counting for 
crisp classifiers, while the RMSE is more common for regres- 
sion models. For calculating the performance of soft prediction 
and crisp reference, the mean squared error MSE is also known 
as Brier score [42]. 

MAE and RMSE are related: In general, MAE < RMSE < 
V« MAE with the respective number of samples n. For classifi- 
cation, however, no single prediction can deviate by more than 
1 from the reference (and this only for crisp reference member- 
ships): < MAE < 1. Thus, MAE < RMSE < VMAE. For 
soft reference, the upper bounds of both MAE and RMSE are 
lower, as the maximal deviation is the greater of r and 1 - r, 
respectively, for each sample. Fig. 5 illustrates the bounds for 
crisp reference data as well as for our application. 

Inter-class performance. A plod and the derived performance 
measures count both under- and overestimations. This yields 
the expected behavior for class-wise performance measures. 
Performance measures that summarize more than one class 
(e. g. overall accuracy) should either take care of the conse- 
quences beforehand, or they may be normalized according to 
the maximal possible error. Closed world classifiers have one 
under- and one overestimation for each misclassification, thus 
MAE < 2 and RMSE < V2. For one-class classification the 
bounds are MAE < n g and RMSE < yjn^. 




Figure 5: Bounds of the RMSE for crisp reference data (black, 
shaded). Above an MAE of 0.6, the soft references further 
restrict the range MAE and RMSE can possibly take and up- 
per bounds for the sensitivity of the tumor classes N (green), 
A °II (blue), and A °III+ (red) deviate from the black maximum 
RMSE for crisply labelled data. 



2.4. Implementation and Availability 

We implemented the proposed performance measures in R 
[43] as package "softclassval". The package is released under 
GPL 3 (http://www.gnu.org/licenses/gpl.html). 

The project is hosted at http://softclassval.r-forge.r-project.org 
where the current development version, its check results and 
the source of previous versions (via the version control web 
interface) are available. The checks include unit tests to en- 
sure calculational correctness, which consist of ca. twice 
as many lines of code than the actual function definitions, 
softclassval. unittest () executes the unit tests in inter- 
active R sessions if package svUnit [44] is available. 

Stable releases can conveniently be installed from the Com- 
prehensive R Archive Network CRAN (http://cran.r-project.org/ 
package=softclassval, both binaries and source code are avail- 
able) by executing install . packages ("softclassval"). 
Check results from CRAN for a variety of platforms can 
be inspected at http://cran.r-project.org/web/checks/check_results_ 
softclassval.html. 

3. Application to Astrocytoma Grading 

3.1. Experimental and Data Analysis Set-Up 

Experiments and Reference Labels. We prepared cryo sections 
of our samples which were stained for reference diagnosis (for 
the classes, see classifier setup below). Raman maps were 
recorded of the adjacent side of the remaining bulk tissue on an 
evenly spaced grid with step sizes between 200 and 333 um us- 
ing a fiber-optic probe with focus diameter of ca. 60 um (order 
of magnitude: 10 3 cells). Figure 6a shows such a bulk sample 
immediately before Raman measurements. 

Histological diagnosis was obtained for the parallel section 
(fig. 6b) and transferred to the measurement grid without any 





(a) bulk sample (b) detailed diagnosis (c) reference labels (d) predictions (e) legend 

Figure 6: Samples. From left to right: (a) bulk sample ready for measurement; (b) histology results for methylene-blue stained 
parallel section; (c) reference labels; (d) predictions from one of the 125 cross validation iterations. The colors for reference labels 
and prediction are obtained by mixing the respective parts of the green (class N), blue (class A°II) and red (class A°III+) colours 
which stand for the pure tissues as indicated in the legend (e). Most of the sample area is diagnosed as a mixture of tumor °II with 
normal cells (reference membership: 0.1 N, 0.9 A°II, 0.0 A°III+; blue color). On the left is an area where the pathologist expressed 
uncertainty: the tissue may be either normal leptomeninges or necrotic. This uncertainty was translated to reference labels of 0.5 N, 
0.0 A°II and 0.5 A°III+ (brown). A transition zone between these areas got intermediate reference labels (dark violetish). The cross 
validation results show some noise and decidedly lean towards necrosis for the area where the neuropathologist was uncertain. 



Table 1: Overview of the data set [2]. With kind permission of 
Springer Science+Business Media. 
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17 
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27 


8 279 


53 
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53 
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80 
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display of the spectra (fig. 6c). Partial class memberships were 
used for the reference labels where ambiguity or uncertainty 
occurred. For example, tumor tissue between the classes ("ATI 
to Til") was labeled belonging half and half to the respective 
classes. The diagnosis "individual tumor cells in normal tissue" 
and tissue where the histologist was not sure whether it con- 
tained tumor cells were labeled as 0.05 tumor and 0.95 normal, 
and so forth. If shape or deformation of the sample prevented 
the transfer of the diagnosis, the fractions of the respective areas 
on the reference section were used as class membership. 

Figure 7 gives an overview of the pre-processed spectra im- 
mediately before centering on the average spectrum of normal 
grey matter. Table 1 summarizes the data set. A more detailed 
discussion of the samples and data set as well as spectroscopic 
interpretation have been given in [2]. 

Classifier Set-Up. The samples include a number of different 
tissues that were combined into the three classes requested by 
the neurosurgeons: 

N (normal or non-tumor tissues): normal white matter, nor- 
mal gray matter, and small amounts of gliotic tissue. Sur- 
gically, such tissue must be preserved. For convenience, 
we refer to this class as "normal" in the text. 

ATI (low grade tumor morphology): such tissue would lead 
to a diagnosis of an astrocytoma TI if it is the most de- 
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Figure 7: Weighted median, 16 , and 84 percentile spectra. 
Thick line: the mean grey tissue spectrum used for centring [2]. 
With kind permission of Springer Science+Business Media. 

differentiated tissue found. Surgically, this may be thought 
of as "take out if possible". 

ATII+ (high grade morphology): malignant or high grade 
tumor tissues, comprising Til and IV morphologies as 
well as necrotic tissue. These tissues must be excised. 

Note that both class boundaries are of practical importance: the 
boundary between normal and low grade tissues is the intended 
excision border. Yet, in order not to risk damage to normal brain 
tissue, the surgeons frequently have to back up to the border 
between low and high grade morphology. 

The three classes are ordered with increasing malignancy. 
Nevertheless, we model unordered classes here, and an exten- 
sion to ordered class models is outside the scope of this paper. 
We use here the same soft LR classifier as in [2]. Briefly, the 
classifier was trained using the R [43] package nnet [45]. All 
pre-processing was decided by spectroscopic knowledge, no 
data-driven steps were included and no parameter optimization 
was performed. However, we checked that a PLS projection 
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[46] of the spectra onto 25 latent variables as pre-processing 
for LR training did not lead to more than slight changes in the 
prediction (see supplementary figure S.2). 90% of the PLS- 
preprocessed predictions lie within ± 0.07 of the respective pre- 
dictions without PLS pre-processing. As a comparison, 90 % of 
the differences between different cross validation iterations for 
the same spectrum are within ± 0.2. The root mean squared 
difference between iterations is 2.7x the root mean squared dif- 
ference for PLS-pre-processing with 25 latent variables. 

Validation. 125 iterations of an 8-fold cross validation scheme 
were used, splitting the data patient-wise since spectra of one 
patient are not statistically independent. 

Our data set does not reflect the relevant prior probabilities, 
nor are any such data available. Therefore, we calculate only 
sensitivity and specificity and do not report predictive values. 

Software. Data analysis was performed in R [43] using the 
packages R.matlab [47] for data import, hyperSpec [48] for 
spectra handling, pis [49] for multiplicative signal correction 
of co-additions of the spectra and the PLS pre-processing for 
comparison, nnet [45] for the logistic regression, and ggplot2 
[50] for graphical display. 

3.2. Results of the Astrocytoma Grading 

We report corresponding triples of one performance measure 
of all three classes separated by bars (N|A °II|A °III+). A 
tabular overview of the results is available in the supplementary 
material tab. S.l. 

Best, Expected, and Worst Case Performance. Figure 8a shows 
the expected (product AND), best (weak AND) and worst case 
(strong AND) sensitivity and specificity of our models. Note 
that this range accounts solely for the ambiguity of the refer- 
ence data. It does not account for the the uncertainty due to the 
number of test cases nor for uncertainty due to model instability. 
While these uncertainties are not a topic of the present study, it 
may be noted that the standard deviations of the performance 
measures observed over the 125 iterations of the cross valida- 
tion range from 0.007 to 0.013 for the sensitivities and from 
0.004 to 0.006 for the specificities. For the unambiguously la- 
beled (crisp) samples, standard deviations between 0.005 and 
0.017 were observed. All these are much smaller than the sym- 
bol sizes in fig. 8a. 

The expected sensitivity for the intermediate tissue morphol- 
ogy A °II, 0.43, is lower than the sensitivity for both normal 
(0.58) and high grade (0.55) morphologies. This corresponds 
to the A °II class also biologically being in between normal and 
high grade, i. e. the class has two borders relevant to the clas- 
sification problem whereas normal and high grade classes have 
only one relevant border. This pattern is even stronger for the 
strong sensitivity (0.54 1 0.32 1 0.50), but almost vanishes for the 
weak sensitivity (0.62 1 0.57 1 0.62). 

A similar overall pattern is observed for the expected speci- 
ficities (0.82 1 0.69 1 0.80). Normal and high grade tissue are 
rarely confused, the difficulties in the classification lie between 
the consecutive classes. 



The difference between strong and weak AND largely re- 
flects the amount of ambiguity in the reference labels. This be- 
comes clear by comparison with fig. 8b, where the overlap for 
ideal reconstruction of the reference data is shown. The more 
ambiguous the reference, the larger the gap between weak and 
strong performance measure: given the reference labels, the ex- 
pected sensitivity for A°II cannot exceed 0.76, whereas for N 
and A°III+ 0.91 and 0.88 could be reached. The strong sen- 
sitivities (lower bound) cannot be more than 0.86 1 0.64 1 0.82 
for the three classes. The specificity is calculated with all 
samples that do not belong to the class in the denominator, 
and has therefore less ambiguity. Thus expected specificities 
of up to 0.95 |0.93 1 0.91 and worst-case specificities of up to 
0.92 1 0.89 1 0.87 for the three classes are possible. The weak 
sensitivity and specificity can always reach 1 . 

The A°II class has soft borders to both other classes, whereas 
there is much less ambiguity in the reference labels between 
normal and high grade tissues. Class N references are less am- 
biguous than the high grade morphologies A°III+ (fig. 8b). In 
contrast, the predictions with respect to class N reach about 
the same sensitivity and specificity as those of class A°III+ 
(fig. 8a). 

Fig. 9 compares the behaviour of the commonly used (crisp) 
sensitivity and specificity with the soft AND-operators for the 
crisp spectra. The specificity-sensitivity curves of the three 
classes were calculated using the R [43] package ROCR [51]. 
The band width illustrates the variation in model performance 
due to different composition of the training data in different iter- 
ations of the cross validation: shown are the inter quartile range 
and median (25 th , 50 th and 75 th percentiles). 

The primary output of the logistic regression models are pos- 
terior probabilities. As a post-processing step, the spectrum can 
be assigned to a class if the respective posterior probability ex- 
ceeds a given threshold (hardening the prediction). To obtain 
the specificity-sensitivity lines, this threshold is varied. 

Hardening discards small deviations that do not cross the 
threshold, which are therefore not detected in the specificity- 
sensitivity-curve. In contrast, the soft AND-operators work 
directly on the posterior probabilities and penalize already 
small deviations from the reference. This leads to the soft 
performance values (marked +) lying below the specificity- 
sensitivity-curves. The classical calculation of specificity- 
sensitivity-curves is possible only for crisply labelled spectra. 
In order to have the same test sample basis throughout the 
graph, soft spectra were therefore excluded from all calcula- 
tions for figure 9, i. e. the soft performance measures are the 
same results marked by plus (+) signs in figure 8a. For crisp ref- 
erence, all three AND-operators yield the same result as there 
is no ambiguity (see also fig. 3). 

Hardening also influences the variance of the performance 
measure. On the one hand, if the hardening threshold is in 
the range of the predicted posterior probabilities, hardening 
will increase the variance on the performance measure. This 
causes the well-known high variance of the crisp classifier per- 
formance measures: testing with crisp class labels is described 
as a Bernoulli-process, leading to the variance of the observed 
performancecr 2 (p) = p( ~ p) . The soft performance measures do 
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Figure 8: Results for the astrocytoma grading, (a) Expected (product; blue triangle), best (weak AND; red square) and worst case 
(strong AND; green circle) overlap between predicted and reference averaged over all spectra and iterations. Black crosses mark the 
(coinciding) results of all three operators for the crisp reference spectra, (b) The (hypothetical) results for ideal reproduction of the 
reference labels is shown (open symbols), (c) Performance based on the difference between prediction and reference: 1 - wMAE 
(circles) and 1 - wRMSE (squares). The open circles are 1 - VwMAE, the lower bound of the wRMSE given the observed wMAE. 
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Figure 9: Specificity-sensitivity-diagram for hardened predic- 
tions of the crisp data, x: hardening with threshold |. +: soft 
performance. As only crisply labelled spectra are included, all 
soft AND-operators coincide. Colors: green: N, blue: A°II, 
red: A°III+. The bands show the inter quartile range over the 
125 cross validation iterations, thick line: median; The ROC- 
like lines are produced by varying the hardening threshold. 
Hardening discards small deviations that are penalized by the 
soft measures: + are below the specificity-sensitivity-curves. 
The contours behind each x and + contain ca. 50 % of the ob- 
served values over the 125 iterations: hardening considerably 
increases the variance. 



not suffer from this increase in variance. The increase in vari- 
ance will usually be high for models that have rather gradual 
transitions between the classes. On the other hand, for such a 
model extreme thresholds will mean that (almost) all samples 
are on the same side of the threshold. In this (far less common) 
case, hardening actually lowers the variance. In our example, 
that would be e.g. if the A°II class were operated at sensitivity 
of 0.95 with a specificity of 0.10. Operating a classifier at such 
an extreme working point actually requires an adapted training 
strategy, including also an adequate composition of the training 



set. 

Note, however, that extreme thresholds for models that in 
fact predict intermediate posterior probabilities are fundamen- 
tally different from models with very sharp class transition: if 
the transition between the classes is immediate, hardening has 
hardly any effect on the variance, as the predicted posterior 
probabilities are already close to or 1 . 

In our data, assigning each test sample completely to the 
class with the highest posterior probability (threshold — = |, 
working points marked x), we observe crisp sensitivities and 
specificities of (0.86 1 0.66 1 0.81) and (0.83 1 0.67 1 0.86), respec- 
tively - a typical choice of working points close to the major di- 
agonal of the specificity-sensitivity-diagram. The correspond- 
ing variances of the soft sensitivities and specificities (marked 
+) are (1.64 1 0.64 1 0.52) xl0~ 4 and (0.16 1 0.36 1 0.21) xl0~ 4 
whereas the variances of the crisp performance measures are 
(2.84 1 4.13 1 0.85) xl0~ 4 and (0.31 1 2.17 1 0.68) xlO 4 . In other 
words, this default hardening increases the variance about 60 - 
550 % (sensitivity of A°III+ and A°II, respectively). 

Hardening can thus be seen as a noise reduction technique 
that can be beneficial for the predictive performance. However, 
for the measurement of performance, information is lost. This 
loss becomes important in optimization of classifiers: the opti- 
mizer will not be able to distinguish well between slightly dif- 
ferent models using any target function on the basis of hardened 
predictions. Even worse, the most difficult class, A°II, is most 
affected by this avoidable increase in variance. 

Figure 8c gives the results for the calibration-type perfor- 
mance measures. These follow the same general patters al- 
ready discussed for the direct application of the different AND- 
operators: specificity is higher (0.65 1 0.58 1 0.65; 1 - wMAE) 
than sensitivity (0.83 1 0.73 1 0.82) and the low grade morpholo- 
gies are the most difficult class. Again, the standard deviation 
observed over the 125 iterations is lower than the symbol size: 
between 0.004 and 0.013 for 1 - wMAE and between 0.006 
and 0.012 for 1 - wRMSE. All 1 - wRMSE but the speci- 
ficity for normal tissue are closer to the 1 - wMAE than to the 
1 - VwMAE. This indicates small deviations from the ref- 
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erence for many samples rather than few grossly misclassified 
samples. 

4. Summary and Conclusions 

We propose a set of performance measures (sensitivity, speci- 
ficity, predictive values, hit- or error rates, etc.) that can be 
calculated for classifiers with continuous outcome without the 
need of "hardening". 

Ambiguity of the reference of cases that are borderline ac- 
cording to the ground truth or gold standard diagnosis leads 
to ambiguity in the measured performance as well. The pro- 
posed measures reflect this as worst case (strong AND), best 
case (weak AND) and expected (product AND) performance. 

Deviations from the reference can also be evaluated using 
weighted versions of well-known calibration performance mea- 
sures, namely the weighted mean absolute error wMAE and the 
weighted root mean squared error wRMSE. Their comparison 
in addition allows to distinguish situations with many small de- 
viations from few large errors. 

Our new measures improve their classical counterparts in 
four different ways: 

1. While classification assumes perfectly distinct classes, in 
reality this is often not the case. In the past, samples 
with ambiguous reference labels (borderline cases) usually 
were excluded completely from both classifier training and 
testing, or their reference labels were hardened. This has 
serious consequences. Excluding borderline cases from 
classifier training can lead to overestimation of class sep- 
arability, while hardening of class labels samples truly in 
between the classes (e. g. mixed cell population or cell 
population currently undergoing de-differentiation) will 
actually drive the model to overestimate class separabil- 
ity. However, as the classical performance measures do 
not allow to evaluate the model performance for border- 
line cases, this overoptimistic modeling of class separa- 
tion could not be detected, the same is true if the reference 
labels were hardened. If the predictions are hardened as 
well, neither can under-estimation of class separability be 
detected. Excluding borderline cases leaves the validation 
completely blind for the behaviour of the classifier close 
to the class boundaries defined by the reference. Hard- 
ened reference labels probe this region, but high variance 
results. In contrast, the soft performance measures penal- 
ize over- or underestimation of class separability and have 
lower variance than their hardened counterparts. They thus 
open the way for more realistic modeling of class bound- 
aries that uses also borderline training cases. 

2. Truly ambiguous samples may be the actual target of a 
classifier, such as in our example of astrocytoma grad- 
ing for surgical guidance. In that case, borderline cases 
are most important test samples, since testing clear cases 
only cannot be considered representative for the applica- 
tion. The proposed soft measures work with samples diag- 
nosed as borderline cases and thus allow for more realistic 
classifier testing. 



3. Hardening of the outcome has (like other dichotomization 
approaches) been criticized due to the inherent loss of in- 
formation. This causes difficulties when classifier perfor- 
mance is compared. Model optimization relies on detect- 
ing already small differences in the predictive ability of 
the models which is thwarted by the hardening. In con- 
trast, the soft performance measures already report small 
deviations from the reference and thus allow to differenti- 
ate between more similar models than their crisp counter- 
parts. 

4. For models with gradual class transitions (i.e. that actually 
predict intermediate posterior probability values), typical 
hardening thresholds lead to an increase in variance that is 
avoided by the soft performance measures. For our astro- 
cytoma grading, the sensitivity and specificity based on the 
soft AND-operators show between 39 and 84 % less vari- 
ance over the 125 iterations of the 8-fold cross validation 
than sensitivity and specificity based on crisp classifica- 
tion. 
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Supplementary Material 



Table S.l: Results for the Astrocytoma Grading using soft LR. 

(a) The different AND-operators. The fast three cofumns ("ideaf") give the best possibie performance that can be obtained with the given 
reference, i. e. the result if the prediction equals the reference memberships. 
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(b) The regression-type operators. 
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Figure S.l: The values (color) of the three operators (columns) and the soft performance measures (rows) for a single sample as 
function of reference and prediction. The 4 th column (red) gives the width of the interval between strong and weak measures. The 
interval ranges from to 1 for the triangle between the side where the crisp measure is not defined and the center of the input space 
(r = 0.5; p = 0.5). Note the symmetry between the performance measures (compare also fig. If). Z prod are similar to Z weak for small 
values and similar to Z stlong for high values. 

The last row gives the MAE-version of the sensitivity. i c 
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Figure S.2: Distribution of differences in the prediction between classifiers with and without 25 LV PLS as pre-processing (LR vs. 
PLS-LR, same iteration). To better compare the results, the distribution of differences between the iterations of the LR (without 
PLS) are shown as well (dotted) line. 90 % of the PLS-preprocessed predictions lie within ± 0.07 of the respective predictions 
without PLS pre-processing, whereas the 5 th to 95 th percentile of the between-iteration differences range from -0.2 to +0.2. 
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